Presentation on theme: "Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library."— Presentation transcript:
Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library
Ground to cover: Brief background: Web Archiving Service What changes about collecting 2 Case studies in collaboration Across institutions Between faculty, librarians Across disparate archiving systems Between libraries, content owners
Web Archiving Service Developed by the University of California Curation Center of the California Digital Library – Formerly the Digital Preservation Group Outcome of the Web-at-Risk grant – 1 st round of NDIIPP grant work UC campus libraries, NYU, Stanford, University of North Texas
The Web Archiving Service
What Changes About Collecting? 1.The target of collection becomes debatable: – An archive? – A site? – A document?
What Changes About Collecting? 2.Theres a lot we dont know about what were collecting – How big is it? How much storage will it use? – Whats in it? – What are the new publications on the site? – Is the site linking to valuable, relevant information?
Mining Sites for Documents
What Changes About Collecting? 3.Its not always clear what to collect – A national library may have a clear mandate to capture nations web domain – The Institute of Transportation Studies may have an immediately obvious scope of content to collect – What does a large research library collect?
What Changes About Collecting? 4.We dont know how scholars will use this information Object of study could be: – Content of the documents – Site change – Acts of citizen journalism – Blog spam, viruses
There is an ongoing need for case studies that can illustrate possible approaches to early interventions with digital records creators, institutional collaborations, and partnerships with information technology specialists.
Case 1: 2003 Recall
2003 California Recall Archiveb 200+ sites selected by UC Librarians, Stanford Sites crawled by Stanford Computer Science Dept. as part of WebBase project Content captured in entirely different format from the WARC archival format used by WAS & Archive-It Content migrated to WARC format, transferred to CDL in 2008 – public access via WAS in July 2009
Case 1 Collaborative content selection across campuses, institutions Data stewardship across institutions Migration of data across formats, archive data models Collaboration between Social Science faculty, Computer Science grad students A dark archive goes light!
Case 2: California Government
Collaborative Collection State of California Government Information librarians across UC campuses manage the archive. 300 sites derived from California State Agency Directory Source for shared cataloging of key California State Documents Twice yearly captures of all agency sites; more frequent captures of approximately 30 priority sites
Collaborative Collection: Local California Seven archives of local California agencies maintained by separate UC campuses Testing cross-archival search tools to combine all state, local search results 518 sites preserved in local archives Challenge: varying resources, priorities at UC campuses, some geographic areas missed Cornell Web Lab study identifies.ca.gov as the third largest U.S. government subdomain.
Need for Collaboration with Content Owners
Robots.txt Patterns in California State Agency Sites Restricted: California State Library California State Controller Office of State Publishing Secretary of State Not Restricted: Office of Information Security and Privacy Protection Office of Systems Integration Legislative Analyst's Office
Is Robots.txt Really a Copyright Management Tool? The conversation used to be between the library and the publisher. Now, it is between the library and a webmaster. - Gildas Illien, Bibliothèque nationale de France
A Potentially Fruitful Conversation? The National Archives comprehensively archive UK Central Government sites Continuity and Preservation: The National Archives approach to maintaining permanent access to the web presence of UK Central Government - Amanda Spencer and Alison Heatherington
Case 2 Collaboration across campuses in selection, resource allocation Shared collection of material relevant to all campuses Communication underway with state agencies, webmasters Potential to provide service directly to agency site Potential to begin linking archives together
Parting thoughts MANY more examples of collaborative work! – End of Term Harvest – International Internet Preservation Consortium – international Olympics archive – Zepheira: data visualization portal to NDIIPP content …longer-term preservation costs for these kinds of materials are not well understood. In the digital world, it is all too easy to acquire materials that a library cannot afford to keep in perpetuity.