Presentation is loading. Please wait.

Presentation is loading. Please wait.

Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

Similar presentations

Presentation on theme: "Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library."— Presentation transcript:

1 Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library

2 Ground to cover: Brief background: Web Archiving Service What changes about collecting 2 Case studies in collaboration Across institutions Between faculty, librarians Across disparate archiving systems Between libraries, content owners

3 Web Archiving Service Developed by the University of California Curation Center of the California Digital Library – Formerly the Digital Preservation Group Outcome of the Web-at-Risk grant – 1 st round of NDIIPP grant work UC campus libraries, NYU, Stanford, University of North Texas

4 The Web Archiving Service


6 What Changes About Collecting? 1.The target of collection becomes debatable: – An archive? – A site? – A document?

7 What Changes About Collecting? 2.Theres a lot we dont know about what were collecting – How big is it? How much storage will it use? – Whats in it? – What are the new publications on the site? – Is the site linking to valuable, relevant information?

8 Mining Sites for Documents

9 What Changes About Collecting? 3.Its not always clear what to collect – A national library may have a clear mandate to capture nations web domain – The Institute of Transportation Studies may have an immediately obvious scope of content to collect – What does a large research library collect?

10 What Changes About Collecting? 4.We dont know how scholars will use this information Object of study could be: – Content of the documents – Site change – Acts of citizen journalism – Blog spam, viruses

11 There is an ongoing need for case studies that can illustrate possible approaches to early interventions with digital records creators, institutional collaborations, and partnerships with information technology specialists.

12 Case 1: 2003 Recall

13 2003 California Recall Archiveb 200+ sites selected by UC Librarians, Stanford Sites crawled by Stanford Computer Science Dept. as part of WebBase project Content captured in entirely different format from the WARC archival format used by WAS & Archive-It Content migrated to WARC format, transferred to CDL in 2008 – public access via WAS in July 2009


15 Case 1 Collaborative content selection across campuses, institutions Data stewardship across institutions Migration of data across formats, archive data models Collaboration between Social Science faculty, Computer Science grad students A dark archive goes light!

16 Case 2: California Government

17 Collaborative Collection State of California Government Information librarians across UC campuses manage the archive. 300 sites derived from California State Agency Directory Source for shared cataloging of key California State Documents Twice yearly captures of all agency sites; more frequent captures of approximately 30 priority sites

18 Collaborative Collection: Local California Seven archives of local California agencies maintained by separate UC campuses Testing cross-archival search tools to combine all state, local search results 518 sites preserved in local archives Challenge: varying resources, priorities at UC campuses, some geographic areas missed Cornell Web Lab study as the third largest U.S. government subdomain.

19 Need for Collaboration with Content Owners

20 Robots.txt Patterns in California State Agency Sites Restricted: California State Library California State Controller Office of State Publishing Secretary of State Not Restricted: Office of Information Security and Privacy Protection Office of Systems Integration Legislative Analyst's Office

21 Consistent design Strong patterns to restrictions User-agent: * Disallow: /images Disallow: /classes Disallow: /cgi-bin Disallow: /htdig Disallow: /js Disallow: /styles Disallow: /ssi Disallow: /css Disallow: /javascript 28 Sites read exactly:

22 Is Robots.txt Really a Copyright Management Tool? The conversation used to be between the library and the publisher. Now, it is between the library and a webmaster. - Gildas Illien, Bibliothèque nationale de France

23 A Potentially Fruitful Conversation? The National Archives comprehensively archive UK Central Government sites Continuity and Preservation: The National Archives approach to maintaining permanent access to the web presence of UK Central Government - Amanda Spencer and Alison Heatherington

24 Case 2 Collaboration across campuses in selection, resource allocation Shared collection of material relevant to all campuses Communication underway with state agencies, webmasters Potential to provide service directly to agency site Potential to begin linking archives together

25 Parting thoughts MANY more examples of collaborative work! – End of Term Harvest – International Internet Preservation Consortium – international Olympics archive – Zepheira: data visualization portal to NDIIPP content …longer-term preservation costs for these kinds of materials are not well understood. In the digital world, it is all too easy to acquire materials that a library cannot afford to keep in perpetuity.

Download ppt "Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library."

Similar presentations

Ads by Google