Presentation on theme: "Www.kb.se Kulturarw³ Capturing the web The Swedish experience www.kb.se/kw3."— Presentation transcript:
Kulturarw³ Capturing the web The Swedish experience
Background Kulturarw 3 –goals –strategy –Sweden on the net? Harvesting –Software –Fimding links –problem Statistics –What have we got? The Archive –priorities –storage –what we save Development –IIPC –Tools, format conclusion Content
Background Legal deposit, 1661 Latest revision 1993 –Only electronic documents in fixed form –CD-ROM, diskettes New law –juli 1:st, 2002, exception from personal privacy law. First Swedish web news paper lost – Printed newspapers since 1645 Kulturarw 3 started 1996 Still waiting for new legal deposit law
Goals All web pages in Sweden –pictures, video etc. –.se,.and other Top Level Domains –Electronic journals
Strategy: two choices Select what is important How to know what will be considered important in the future? Labour intense Everything using automatic software Gets everything (well, not really) Less labour intense
Strategy Take snapshots of the Swedish web a few times each year –Gets all –Needs less labour –Computer memory is cheap –However, large volumes makes quality control difficult Selective harvesting about 150 newspapers every day In the future; events, eg elections With as little human intervention as possible.
Only the domain part relevant Sweden on the web?.se.nu, Niue popular in Sweden. nu means now in Swedish Others if the server is geographically located in Sweden Language?
Harvesting software A harvester (crawler, spider) collects web pages by automatically following links and saving pages Open-source harvester: Heritrix -Main developer: Internet Archive (IA) -Written in Java. Active community. -Designed for archiving. not indexing. Earlier: Modified version of Combine -From NetLab, Lund university. Important! Indexing isn't archiving and archiving isn't indexing! Collects also pictures, sound etc.
Problems …or challenges if you are an optimist… Scripts Interactive pages Password protected Video/streaming material Social sites
Statistics – what did we get? Bulk crawls (everything Swedish) First sweep – 1997, only.se million files GB data A sweep ,.se and other tld:s million files GB data
Statistics – what did we get? Periodika (newspapers) Started june miljoner URLer 4.0 TB About URLs every day
More statistics Bulk (everything Swedish) web servers (including inlines) swedish -.se 50 % -.nu 21% - others 29% 1549 different MIME-typer found. –Html about 50% –text/html + image/gif + image/jpeg + appl/pdf + text/plain about 97% of the documents. –A lot of garbage, miss-spellings etc.
Accessing the archive Firsta priority is to access the archive using traditional web technologies. Surf, in space and time Free text search Nb, not using traditional library methods: cataloging etc.
Arkivet, vad vi sparar Allting förknippat med ett objekt, inkl. metadata, sparas i en fil) Metadata från insamlingsprocessen Metadata om objektet (från server) Objektet (i ursprunglig form) En enhet (fil) i arkivet
Development International Internet Preservation Consortium (IIPC) –Started by Internet Archive national libraries of: Sweden, Norway, Finland, Danmark, Iceland, UK, France, Italy, Canada, Australia och USA (LoC) Now many more –Develop common standards, tools and methods for web archiving. –Raise awareness
Development, standards Archiving formats –Earlier formats MIME (Multipart Mail Extension) ARC NedLib –WARC (Web ARChive file format) File format for saving web material each web page is one record in a warc-file A record contains metada and content ISO
Development, Tools Tools –Harvesting: Heritrix Designed for archiving (NOT a modified indexer) Open soure: Java, Linux etc. Supported by IIPC Mainly developed by Internet Archive with contributions Will (is) support WARC. Supports ARC and MIME –Surfing tools New Wayback Machine WERA - surf with time line WAXToolbar – support when using new WM –NutchWax Free text search (with time line) –Curator tool Possible for a new-technician to do collection and quality control
Use Open standards, open source IIPC Get users of the archive Think big. Hundreds of tera bytes, billions of files Accept that what you do is a best effort Advices
The web is constantly changing continuous development. Possible to get a reasonable picture of the web. But never complete! Do something now Conclusion