Presentation on theme: "Www.kb.se Kulturarw³ Capturing the web The Swedish experience www.kb.se/kw3."— Presentation transcript:
www.kb.se Kulturarw³ Capturing the web The Swedish experience www.kb.se/kw3
www.kb.se Background Kulturarw 3 –goals –strategy –Sweden on the net? Harvesting –Software –Fimding links –problem Statistics –What have we got? The Archive –priorities –storage –what we save Development –IIPC –Tools, format conclusion Content
www.kb.se Background Legal deposit, 1661 Latest revision 1993 –Only electronic documents in fixed form –CD-ROM, diskettes New law –juli 1:st, 2002, exception from personal privacy law. First Swedish web news paper lost – Printed newspapers since 1645 Kulturarw 3 started 1996 Still waiting for new legal deposit law
www.kb.se Goals All web pages in Sweden –pictures, video etc. –.se,.and other Top Level Domains –Electronic journals
www.kb.se Strategy: two choices Select what is important How to know what will be considered important in the future? Labour intense Everything using automatic software Gets everything (well, not really) Less labour intense
www.kb.se Strategy Take snapshots of the Swedish web a few times each year –Gets all –Needs less labour –Computer memory is cheap –However, large volumes makes quality control difficult Selective harvesting about 150 newspapers every day In the future; events, eg elections With as little human intervention as possible.
www.kb.se http://www.kb.se/kbstart.htm Only the domain part relevant Sweden on the web?.se.nu, Niue popular in Sweden. nu means now in Swedish Others if the server is geographically located in Sweden Language?
www.kb.se Harvesting software A harvester (crawler, spider) collects web pages by automatically following links and saving pages Open-source harvester: Heritrix -Main developer: Internet Archive (IA) -Written in Java. Active community. -Designed for archiving. not indexing. Earlier: Modified version of Combine -From NetLab, Lund university. Important! Indexing isn't archiving and archiving isn't indexing! Collects also pictures, sound etc.
www.kb.se Problems …or challenges if you are an optimist… Scripts Interactive pages Password protected Video/streaming material Social sites
www.kb.se Statistics – what did we get? Bulk crawls (everything Swedish) First sweep – 1997, only.se - 6.8 million files - 160 GB data A sweep 2007-2008,.se and other tld:s - 270 million files - 11500 GB data
www.kb.se Statistics – what did we get? Periodika (newspapers) Started june 2002 88 miljoner URLer 4.0 TB About 40 000 URLs every day
www.kb.se More statistics Bulk (everything Swedish) 823 100 web servers (including inlines) 651 700 swedish -.se 50 % -.nu 21% - others 29% 1549 different MIME-typer found. –Html about 50% –text/html + image/gif + image/jpeg + appl/pdf + text/plain about 97% of the documents. –A lot of garbage, miss-spellings etc.
www.kb.se Accessing the archive Firsta priority is to access the archive using traditional web technologies. Surf, in space and time Free text search Nb, not using traditional library methods: cataloging etc.
www.kb.se Arkivet, vad vi sparar Allting förknippat med ett objekt, inkl. metadata, sparas i en fil) Metadata från insamlingsprocessen Metadata om objektet (från server) Objektet (i ursprunglig form) En enhet (fil) i arkivet
www.kb.se Development International Internet Preservation Consortium (IIPC) –Started by Internet Archive national libraries of: Sweden, Norway, Finland, Danmark, Iceland, UK, France, Italy, Canada, Australia och USA (LoC) Now many more –Develop common standards, tools and methods for web archiving. –Raise awareness
www.kb.se Development, standards Archiving formats –Earlier formats MIME (Multipart Mail Extension) ARC NedLib –WARC (Web ARChive file format) File format for saving web material each web page is one record in a warc-file A record contains metada and content ISO 28500.
www.kb.se Development, Tools Tools –Harvesting: Heritrix Designed for archiving (NOT a modified indexer) Open soure: Java, Linux etc. Supported by IIPC Mainly developed by Internet Archive with contributions Will (is) support WARC. Supports ARC and MIME –Surfing tools New Wayback Machine WERA - surf with time line WAXToolbar – support when using new WM –NutchWax Free text search (with time line) –Curator tool Possible for a new-technician to do collection and quality control
www.kb.se Use Open standards, open source IIPC Get users of the archive Think big. Hundreds of tera bytes, billions of files Accept that what you do is a best effort Advices
www.kb.se The web is constantly changing continuous development. Possible to get a reasonable picture of the web. But never complete! Do something now Conclusion