Metadata Extraction and Web Archives: Automating the Record Creation Process Tracy Meehleib Library of Congress, NDMSO NDIIPP June 25, 2009.

Slides:



Advertisements
Similar presentations
EPrints Web Configuratio n Management. SQL database Web server Scripts to configure repository activities Configuration files EPrints - the Administrator's.
Advertisements

UKOLN is supported by: Using the RSLP schema Ann Chapman Collection Description Focus A centre of expertise in digital information management
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
OSI and Bibliographic Access: opening a conversation Caroline Arms Kevin Novak Michelle Rago.
The Library of Congress Cooperative Web Archiving Project Abbie Grotke, Library of Congress Grant Harris, Library of Congress Jennifer Long, Georgetown.
Introducing Copac Copac is a national catalogue giving access to the merged catalogues of c.50 major libraries and collections in the UK and Ireland Copac.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
Metadata for Digital Content Jane Mandelbaum, Ann Della Porta, Rebecca Guenther.
Information Retrieval in Practice
Introducing Symposia : “ The digital repository that thinks like a librarian”
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
Access to Digital Materials through the Library of Congress OPAC Presentation by Dr. Barbara B. Tillett Chief, Cataloging Policy and Support Office Library.
River Campus Libraries Find Articles A Web Redesign for ENCompass David Lindahl Web Initiatives Manager River Campus Libraries University of Rochester.
River Campus Libraries Find Articles A Web Redesign for ENCompass David Lindahl Web Initiatives Manager River Campus Libraries University of Rochester.
LSTA Digital Imaging Grants Presentation Projects Workshop September 13, 2002 Wendy Sistrunk Music Catalog Librarian University of Missouri—Kansas City.
Overview of Search Engines
NOBLE Digital Library. How does it work? The NOBLE Digital Library uses the DSpace platform. Image files and metadata are imported into DSpace using.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
Metadata: Its Functions in Knowledge Representation for Digital Collections 1 Summary.
Session 7 Selection of Online Resources and Options for Providing Access.
Access to Individual Harvested Sites in a Web Archive Tracy Meehleib DLF Fall Forum, Providence, RI November 13th, 2008.
By Carrie Moran. To examine the Metadata Object Description Schema (MODS) metadata scheme to determine its utility based on structure, interoperability.
OpenURL: Linking LC’s E-Resources Ardie Bausenbach Automated Planning and Liaison Office Library of Congress November 24, 2003.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Improving the Catalogue Interface using Endeca Tito Sierra NCSU Libraries.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
 Popularity of browsers:  Popularity of search.
PUBLISHING ONLINE Chapter 2. Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals.
Metadata Considerations Implementing Administrative and Descriptive Metadata for your digital images 1.
Organizing Internet Resources OCLC’s Internet Cataloging Project -- funded by the Department of Education -- from October 1, 1994 to March 31, 1996.
Ms. Irene Onyancha ISTD/Library & Information Management Services United Nations Economic Commission for Africa The Second Session of the Committee on.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
DACS Describing Archives: A Content Standard. The Background  Archives, Personal Papers & Manuscripts, 1980s –New Technologies with Web, XML, EAD –Revision.
Library needs and workflows Diane Boehr Head of Cataloging National Library of Medicine, NIH, DHHS
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
CONTENT DISCOVERY, SERVICES, AND SUSTAINED ACCESS Timothy Cole, William Mischo, Beth Sandore, Sarah Shreeves ~ University of Illinois Library
Endeca: a faceted search solution for the library catalog Kristin Antelman & Emily Lynema UNC University Library Advisory Council June 15, 2006.
Introduction to metadata
The Digital Library for Earth System Science: Contributing resources and collections GCCS Internship Orientation Holly Devaul 19 June 2003.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
DL-Application for the University Archive Jena Ulrike Krönert, Mathias Hegner FSU Jena.
Gold Rush Electronic Resource Discovery and Management System George Machovec Colorado Alliance of Research Libraries
The 3 M’s: MINERVA, MODS, and METS Allene Hayes (LC) Rebecca Guenther (LC) Leslie Myrick (NYU) DLF -- New Orleans April 20, 2004.
How Do We Keep From Getting Further Behind? A Case Study in the Application of Minimal-Level Description in the OSU Archives Elizabeth Nielsen Northwest.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
Basic Encoded Archival Description METRO New York Library Council Workshop Presented by Lara Nicosia December 9, 2011 New York, NY.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
ADL Alexandria digital Library – Davidson Library, UCSB Alexandria Digital Library (ADL) Brief intro to ADL Item vs Collection Level Metadata Collection.
ALA Annual Meeting Claire Cocco Global Product Manager CONTENTdm Users Group June 30th, 2008.
The library is open Digital Assets Management & Institutional Repository Russian-IUG November 2015 Tomsk, Russia Nabil Saadallah Manager Business.
NDIIPP Access Project Building on Metadata NDIIPP Partner Meeting June 25, 2009.
Santi Thompson - Metadata Coordinator Annie Wu - Head, Metadata and Bibliographic Services 2013 TCDL Conference Austin, TX.
The ___ is a global network of computer networks Internet.
A RCHIVAL COLLECTIONS IN A D IGITAL W ORLD Cheryl Walters Nov. 6, 2008.
1 CS 430: Information Discovery Lecture 7 Automatic Generation of Catalog Records.
Discover ScholarSphere A repository service collaboration between the University Libraries and ITS.
Access for user self- sufficiency: making rich local content intuitively available Catalog Transformed: From Traditional to Emerging Models of Use Program.
Tiewei (Lucy) Liu Metadata Librarian June 26, 2016
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
PREMIS Tools and Services
IDEALS at the University Of Illinois: A Case Study of Integration Between an IR and Library Discovery Systems Sarah L. Shreeves University of Illinois.
Presentation transcript:

Metadata Extraction and Web Archives: Automating the Record Creation Process Tracy Meehleib Library of Congress, NDMSO NDIIPP June 25, 2009

Library of Congress Web Archives EVENT-DRIVEN September 11th, 2001 Winter Olympic Games 2002 U.S. Congresses 107 th, 108 th, 109 th, etc. U.S. Elections 2000, 2002, 2004, 2006, 2008, etc. Iraq War Papal Transition 2005 Supreme Court Nominations Crisis in Darfur, Sudan 2006 Egypt 2008 FORMAT/COLLECTION-DRIVEN Organizational Sites corresponding to Papers/Archives collected by LC’s Manuscript Division Sites corresponding to creators whose works are collected by/represented in LC’s P&P Division Legal Blawgs identified by the Law Division

Iraq War, 2003 Web Archive

Crisis in Darfur, Sudan 2006 Web Archive

LC Manuscript Division Archive of Organizational Web Sites

Visual Image Web Archive

Legal Blawgs Web Archive

Egypt, 2008 Web Archive

Library of Congress Web Archives Election Election Election Election Election * 107 th Congress107 th Congress th Congress108 th Congress th Congress109 th Congress580* 110 th Congress110 th Congress580* September 11, Winter Olympics Olympics 2002 Iraq War 2003Iraq War Papal Transition 2005Papal Transition Crisis In Darfur, Sudan 2006Crisis In Darfur, Sudan Visual Images17 Organizational Sites, Manuscript Division30 U.S. Supreme Court Nominations Legal Blawgs Egypt

Web Archives Processing Workflow Metadata extraction results in a preliminary MODS record for each archived siteMODS Review and enhance record, revising some values if needed (title, language, abstract, keywords) and adding some values (LCSH headings — subjects and sometimes names) Register item-level handles Load MODS records onto server, index, generate item-level search/browse Create collection-level record in ILS and register collection-level handle

Why Provide Site Level Access to these Sites? Access limitations of searching W/ARC files by keyword and URLW/ARC files Increase access using controlled vocabularies (LCSH, TGM, etc.) Leverage subject cataloging & language expertise to enhance subject access as economically as possible Resources become integratable with other library resources at the item level Better precision and recall searching within and across archives Persistent IDs/handles allow for stable citations and digital scholarship at site-level Leverage use of existing search/browse systems

How Do We Provide Site-Level Access to these Sites? Boilerplate as much relevant archive-level and site-level metadata as is possible into the MODS template Extract as much useful metadata as is possible from archived web sites W/ARC files (using a perl script or other method that grabs the metadata from meta tags in the W/ARC files) — titles, dates, file types, abstracts, subject keywords, etc.W/ARC files Leverage LC subject cataloging & language expertise and controlled vocabularies to add subject access

Overview of MODS Record Data Elements Title - Extracted from W/ARC file/HTML title tag - Cataloger uses if viable, otherwise supplies Alternative Title - Cataloger supplies if another useful and different title displays on piece Name Personal - Included for some archives, when relevant, cataloger supplies Name Corporate - Included for some archives, when relevant, cataloger supplies Type of Resource - Boilerplate “text” Genre- Boilerplate “Web site” Origin Info- Extracted from W/ARC file – first/last dates captured YYYMMDD(iso8601) Language- Boilerplate in if known (iso639-2b code) - Cataloger can supply additional languages Physical Description- Extracted from W/ARC file/MIME type, e.g., text/css, image/jpeg Abstract- Extracted from W/ARC file/META name=description content - Cataloger can edit/enhance Subject/Keywords- Extracted from W/ARC file/META name=keywords content - Cataloger can edit/enhance Subject/LCSH- Cataloger supplies Collection Title/PID- Boilerplate, collection title & collection PID/handle Identifier- Boilerplate, variant of handle, e.g, hdl:loc.natlib/mrva Note- Extracted from W/ARC file, resolves to URL for active site Location/Usage- Boilerplate item-level PID/handle - PID is registered to resolve to archived Web site URL Access Condition- Boilerplate rights info/permissions info – imported from OSI records Record Info- Boilerplate record creation date - Boilerplate record identifier, handle suffix mrva

Crisis in Darfur, Sudan 2006 Web Archive Archive size:218 sites Harvest info:1 phase, multiple captures Frequency: Varies--weekly to monthly crawls for each site Metadata:1 collection-level MARC record, with collection level PIDcollection-level MARC record 218 item-level MODS records, with item-level PIDsitem-level MODS records LCSH:1 boilerplate LCSH heading Unlimited specific LCSH headings at site level— these are selected by cataloger from a list of about 20 LCSH terms that relate to the content in the archive

Catalogers’ List for Darfur, 2006 Web Archive

Resource Page for an Archived Web Site, Darfur, 2006 Web Archive

Bilingual (eng/nor) Archived Web Site - Darfur, 2006 Web Archive

Preliminary MODS Record – Darfur, 2006 Web Archive

MODS Subject Heading List - Darfur, 2006 Web Archive

Completed MODS Record – Darfur, 2006 Web Archive afrika.no: The Norwegian Council for Africa text Web site eng nor application/download application/x-javascript image/bmp image/gif image/jpeg image/pjpeg text/css text/html afrika.no - The Index on Africa and Africa News Update. Features news on and links to all countries in Africa. With sections on Culture, Development, Economy, Education, Environment, Health, Human Rights, News and Politics. By the Norwegian Council for Africa. afrika, africa, culture, development, economy, education, environment, health, politics, travel Sudan History Darfur Conflict, International relief Sudan Economic conditions Crisis in Darfur, Sudan Web Archive, hdl:loc.natlib/mrva Access restricted to on-site users at the Library of Congress mrva

Displayed MODS Record - Darfur, 2006 Web Archive

Library of Congress Web Archives Homepage

Collection Overview - Darfur, 2006 Web Archive

Search Page - Darfur, 2006 Web Archive

Browse Page - Darfur, 2006 Web Archive

MARC Collection-Level Record - Darfur, 2006 Web Archive

Google Search – Item in Darfur, 2006 Web Archive

LUCENE SEARCH INTERFACE ARCHIVE-LEVEL HOMEPAGE & MODS RECORDS SEARCH/BROWSE 107 th Congress 108 th Congress Election 2002 Election 2004 September 11, 2001 Olympics 2002 IraqWar 2003 Papal Transition 2005 Crisis In Darfur 2006 Egypt 2008 Legal Blawgs ILS OPAC MARC COLLECTION-LEVEL RECORD LC Web Archives – Levels of Access NUTCHWAX NUTCHWAX INDEXES W/ARC FILES ARCHIVED WEB SITES MODS ITEM-LEVEL RECORDS INTERNET SEARCH ENGINES

Results - Pros Archived resources are searchable and indexable along with other library collections and online resources Item-level and collection-level subject access and controlled vocabularies make these resources highly integratable at the item level and collection-level Site-level access facilitates searching and browsing within and across web archives—ability to find, refind & cite resources Good use and reuse of extracted and human-created metadata—friendly environment in which traditional catalogers learn XML and MODS—project benefits from specialized subject cataloger expertise Flexible and sustainable infrastructure for making web archives available for digital scholarship—stable/citable persistent IDS at the site level and the collection level

Results - Cons Scalability—approach works well with archives of up to 2,000 sites, but hasn’t been tested w/much larger archives Project investment is basically the same for each archive—whether it’s 100 sites or 2000 sites--project setup still requires template creation, metadata extraction, LCSH analysis at archive level, handle registration, etc.—so essentially the same amount of resources regardless of archive size

Future Considerations MODS tools—need for a flexible MODS input/editing form that would hide boilerplate and extracted metadata that the cataloger does not need to see—we have experimented w/XMLSPY’s Authentic and XForms, but we lose flexibility w/regard to parsed subjects with both of these Future plans to integrate the NutchWAX component to provide more comprehensive keyword access to W/ARC files — this will complement existing collection and site-level access Experiment tag cloud generators to increase subject keyword access

Tag Cloud Generated from Archived Web Site Darfur, 2006 Web Archive

THAT’S ALL FOLKS