Presentation is loading. Please wait.

Presentation is loading. Please wait. Kulturarw³ Capturing the web The Swedish experience

Similar presentations

Presentation on theme: " Kulturarw³ Capturing the web The Swedish experience"— Presentation transcript:

1 Kulturarw³ Capturing the web The Swedish experience

2 Background Kulturarw 3 –goals –strategy –Sweden on the net? Harvesting –Software –Fimding links –problem Statistics –What have we got? The Archive –priorities –storage –what we save Development –IIPC –Tools, format conclusion Content

3 Background Legal deposit, 1661 Latest revision 1993 –Only electronic documents in fixed form –CD-ROM, diskettes New law –juli 1:st, 2002, exception from personal privacy law. First Swedish web news paper lost – Printed newspapers since 1645 Kulturarw 3 started 1996 Still waiting for new legal deposit law

4 Goals All web pages in Sweden –pictures, video etc. –.se,.and other Top Level Domains –Electronic journals

5 Strategy: two choices Select what is important How to know what will be considered important in the future? Labour intense Everything using automatic software Gets everything (well, not really) Less labour intense

6 Strategy Take snapshots of the Swedish web a few times each year –Gets all –Needs less labour –Computer memory is cheap –However, large volumes makes quality control difficult Selective harvesting about 150 newspapers every day In the future; events, eg elections With as little human intervention as possible.

7 Only the domain part relevant Sweden on the web?, Niue popular in Sweden. nu means now in Swedish Others if the server is geographically located in Sweden Language?

8 Harvesting software A harvester (crawler, spider) collects web pages by automatically following links and saving pages Open-source harvester: Heritrix -Main developer: Internet Archive (IA) -Written in Java. Active community. -Designed for archiving. not indexing. Earlier: Modified version of Combine -From NetLab, Lund university. Important! Indexing isn't archiving and archiving isn't indexing! Collects also pictures, sound etc.

9 Problems …or challenges if you are an optimist… Scripts Interactive pages Password protected Video/streaming material Social sites

10 Statistics – what did we get? Bulk crawls (everything Swedish) First sweep – 1997, - 6.8 million files - 160 GB data A sweep 2007-2008,.se and other tld:s - 270 million files - 11500 GB data

11 Statistics – what did we get? Periodika (newspapers) Started june 2002 88 miljoner URLer 4.0 TB About 40 000 URLs every day

12 More statistics Bulk (everything Swedish) 823 100 web servers (including inlines) 651 700 swedish 50 % 21% - others 29% 1549 different MIME-typer found. –Html about 50% –text/html + image/gif + image/jpeg + appl/pdf + text/plain about 97% of the documents. –A lot of garbage, miss-spellings etc.

13 Trends Html: stable, 50-60%. Increasing lately Jpeg: increasing, 11% (-97), 27% (05) Gif: decreasing, 23% (-97), 11% (-05) Pdf: increasing, 9:th to 4:th position

14 Accessing the archive Firsta priority is to access the archive using traditional web technologies. Surf, in space and time Free text search Nb, not using traditional library methods: cataloging etc.

15 Arkivet, vad vi sparar Allting förknippat med ett objekt, inkl. metadata, sparas i en fil) Metadata från insamlingsprocessen Metadata om objektet (från server) Objektet (i ursprunglig form) En enhet (fil) i arkivet

16 Development International Internet Preservation Consortium (IIPC) –Started by Internet Archive national libraries of: Sweden, Norway, Finland, Danmark, Iceland, UK, France, Italy, Canada, Australia och USA (LoC) Now many more –Develop common standards, tools and methods for web archiving. –Raise awareness

17 Development, standards Archiving formats –Earlier formats MIME (Multipart Mail Extension) ARC NedLib –WARC (Web ARChive file format) File format for saving web material each web page is one record in a warc-file A record contains metada and content ISO 28500.

18 Development, Tools Tools –Harvesting: Heritrix Designed for archiving (NOT a modified indexer) Open soure: Java, Linux etc. Supported by IIPC Mainly developed by Internet Archive with contributions Will (is) support WARC. Supports ARC and MIME –Surfing tools New Wayback Machine WERA - surf with time line WAXToolbar – support when using new WM –NutchWax Free text search (with time line) –Curator tool Possible for a new-technician to do collection and quality control

19 Use Open standards, open source IIPC Get users of the archive Think big. Hundreds of tera bytes, billions of files Accept that what you do is a best effort Advices

20 The web is constantly changing continuous development. Possible to get a reasonable picture of the web. But never complete! Do something now Conclusion

21 Questions? Comments? ? ? ?

22 Links IIPC: Kulturarw3: Internet Archive:

Download ppt " Kulturarw³ Capturing the web The Swedish experience"

Similar presentations

Ads by Google