Presentation is loading. Please wait.

Presentation is loading. Please wait.

Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus

Similar presentations


Presentation on theme: "Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus"— Presentation transcript:

1 Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus bja@netarkivet.dk http://netarchive.dk

2 Agenda  New legal deposit law in Denmark  Collection strategies  NetarchiveSuite software package  Snapshot harvesting  Selective harvesting  Event harvesting  Challenges in snapshot harvesting  Snapshot harvesting usefulness  Future work

3 Legal deposit law 1  Revision of the legal deposit law in 1997 -> legal deposit included static documents on the internet  During in 1998-1999 we found out that: We were actually preserving the least interesting part  Many of the documents in that collection are also available in print  A lot of work was done between 2000-2004 2 pilot projects run by the two national libraries  Testing different software / different strategies for archiving / storing web material A governmental publication on ”preserving the Danish digital cultural heritage” (2003) A report to the ministry of culture (2004) outlining  Recommendations from the two national libraries on how to solve the ”entire” problem  Issues to be covered by a new revision of the legal deposit law

4 Legal deposit law 2  A new revision came into force on july 1st 2005 Allowing the two national libraries to automatically gather all danish websites Danish roughly defined as:  Websites on the.dk TLD  Websites minded on a Danish audience / written in Danish  Websites about Danish people (Hans Christian Andersen)  More or less any site of interest to Denmark We are by law granted access to all relevant data from the.dk TLD administrator

5 Legal deposit law 3  The law covers all public available material Material that all Danish people in principal can gain access to  Material which requires action before usage (payment, registration….)  Pay-sites should hand out username / password upon request (for free)  Other interesting parts Combined strategy (snapshot, selective and event-harvesting) Robots.txt explicitly mentioned in the regulations of the law  A lot of the very interesting websites have very restrictive robots.txt’s (we discovered around 35.000 robots.txt-files)  During 6 snap shots of more than 750.000 web sites we had fewer than 50 complaints about robots.txt

6 Legal deposit law 4  In the end led to funding of Netarchive.dk  Virtual centre in cooperation between The Royal Library, Copenhagen The State & University Library, Aarhus  Implementing a complete system  Running the archiving on a daily basis Currently with an annual budget of 450.000 euros  Involving 15 people from the two libraries 4.5 Man-years of man-power

7 The 3 collection strategies  Illustrated by coverage over time  Amount of data collected so far Snapshots: 61 TB (6 times) Selective harvests: 9.5 TB (80 web sites) Event harvests: 5.6 TB (9 events)

8 NetarchiveSuite software package  We needed a curator tool ready at July 1st 2005 Requirement number 1: Operated by librarians  With the web interface librarians can: Define harvests (all three types)  Based on quite simple settings + a number of different predefined heritrix setups Do quality control  Looking at harvest results (simple reports and statistics)  Browsing through harvested material Automated pickup of missing URIs  NetarchiveSuite was released as Open Source in July 2007 Currently used by a number of national libraries

9 Snapshot harvesting  The.dk TLD currently holds > 750.000 active domains We encountered around 42.000 Danish domains outside the.dk TLD  By extracting links from the entire.dk web space – checking country-code by IP-number (GeoIP)  By doing Google searches on Danish localities (city names..)  With 8 machines we can do One complete snapshot (including deDuplication) at 20TB in 80 days DeDuplication saves around 30% of the storage space

10 Selective harvesting  Archiving of 80 selected websites News sites ”Typical” dynamic and heavily used sites representing civic society, the commercial sector and public authorities Experimental and/or unique sites, documenting new ways of using the web (e.g. net art) Harvested much more frequent  From weekly to several times per day

11 Event harvesting  Combining the other two strategies Taking a larger number of sites (200-3000) On a more frequent basis (daily / weekly) In a shorter period of time  We have done 9 event harvests so far Elections, different national events  We have pre-defined some harvest-definitions on especially news-sites (both local and national) With one click we can start these if a sudden event should happen – to ensure collection of important sites from the very beginning

12 Challenges in snapshot harvesting  Number of domains is constantly growing 2005: 607.000 domains – 480.0o0 active 2008: 950.000 domains – 750.000 active  Domains are growing bigger and bigger Audio/Video is getting more and more popular Sites larger than 10Mb increased from 40.000 to 90.000 Sites larger than 500Mb increased from 6.000 to 12.000  Web 2.0 makes harvesting difficult Web material is inlined from other web sites – from all over the world  The border of a web site is disappearing The web is going more and more dynamic – Flash / Ajax  The amount of traps and spam grows constantly In Denmark librarians manually inspect all websites larger than 1Gb  Currently over 3000 domains  They identify aliases and potential crawler traps That task should be (semi)-automated

13 Snapshot harvesting usefulness  With snap shot harvesting a web archive ensures cultural heritage by Archiving regular ”pictures” of entire national parts of the internet Archiving as much as possible in a quite cheap way  Netarchive.dk: Storage space and 15 hours per week for librarians  Snap shots is very useful for research in many different areas Linguistics Web technologies File formats and their evolution Web design Genealogy / Ancestor search Web site history And many many more – to be defined in the future  And off cause useful for more ordinary users wanting To find content disappeared from the live web – 40-100 days lifetime  Getting more and more interesting over time  Currently access to Netarchive.dk is limited to researchers

14 Future work  Automating discovery of Danish web sites outside the.dk TLD  Automated quality assurance for large crawls  Automating filtering of web spam and traps  Improving archiving of web 2.0 Dynamic web content Streaming audio/video  Non of these problems are Danish Lets solve them together LIWA – European project working on most of these problems  Danish challenges Working for better access possibilities  On the system level: WayBack Machine / NutchWAX search  On the political level: Change of law

15 Questions ?


Download ppt "Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus"

Similar presentations


Ads by Google