Presentation on theme: "FEDLINK OCLC Users Group Meeting"— Presentation transcript:
1 FEDLINK OCLC Users Group Meeting Metadata StandardsEric ChildressOCLCComments, corrections, suggestions are welcome. Please contact the author viaWashington, DCNovember 18, 2003FEDLINK OCLC Users Group Meeting
2 Overview Fundamentals MetaMap Metadata formats: Types of metadata Document mark-up languages & character encodingsMetaMapMetadata formats:MARC, MODSDC, ONIXTEI, EAD, METS, MIXRDF, FGDC, COSATI
3 Fundamentals Descriptive Technical & Structural 5 types of metadataDescriptiveTitle, author, summary, topic, etc.Technical & StructuralFile size, software needed, file type(s), presentation instructions, etc.Administrative (a.k.a. “meta-metadata”)Record number, record date, record source, etc.RightsCopyright ownership, use privileges, etc.Management[Typically by/for owning agency]: price paid, circulation restrictions, etc.
4 Fundamentals Markup languages Markup languages: Address the structure of a documentConvey instructions to software that will process text to:Index the text for searchingTo render the text (e.g., for screen display or print)Transform the text (e.g., for a voice synthesizer) for some output device(s)The markup is generally invisible to end-usersExtensible Markup Language (XML):XML is a metalanguageAgencies define their own XML to suit their taskBy creating Document Type Definitions (DTDs) or XML schemaData is separate from presentation instructionsPresentation instructions go in a style sheetOffers just the right mix of flexibility and structure
5 Fundamentals Character encoding: Unicode: Character Encodings Used for communicating text characters in a computing environmentHundreds of character encoding standards existCharacter conversion is complex and expensiveUnicode:A single, “comprehensive” global encoding standardIncludes characters from scripts of all major modern, most minor, and selected ancient languages
6 MetaMap http://mapageweb.umontreal.ca/turner/meta/english/metamap.html There are many, many standards. This presentation only discusses a select few…
7 MARC 21 MARC 21 (ISO 2709) Strengths: ISO 2709-based metadata communications protocolChoice of two character encoding options:MARC 8 (ASCII, ANSEL, selected ISO, EACC)Unicode (limited to equivalents of MARC 8 repertoire)XML expression is now also an optionMaintenance agency: Library of Congress w/ NLC, BLStrengths:Well-maintained, mature standardWidely adopted by library communitiesLarge universe of MARC 21 records availableWide choice of software vendorsWeaknesses (in the present & future):Virtually unused outside of librariesLimits on field and record sizeRestricted range of scripts supportedLimited ability to convey complex relationships, hierarchy, attributes at tag/subfield levelMARC 21 is well known to all of you, but it may be helpful to review some key points and new developments before looking at non-MARC standardsMore info:
8 MODS Metadata Object Description Schema (MODS) Value of MODS: Essentially MARC 21 recast in an XML-native frameworkText-based tags rather than numeric ones,Selected clusters of related MARC 21 attributes condensed into single MODS elementMARC 21 readily converts to MODS, but you can’t do a lossless reverse conversion of MODS to MARC 21Maintenance agency: Library of CongressValue of MODS:A rich, library-oriented XML metadata schemaOptimized for from-MARC conversion of legacy recordsWell-suited as a metadata format for OAI harvestingApplications of MODS:LC planning to convert 100K American Memory recordsMinerva project, U of Chicago Press, California Digital Library, others using or planning to use for records for web sites, e-texts.OpenOffice Bibliographic ProjectMODS is essentially an XML rendering of MARC’s content, but unlike MARC XML, the tagging is text rather than numeric.OpenOffice Bibliographic Project:Open Office Bibliographic Project:MODS also simplifies some aspects of MARC
9 MARC 21 & MODS Feature MARC 21 MARC 21 Unicode MARC XML MARC Slim MODS StructureISO 2709XMLEncodingMARC 8UnicodeRepertoire of scriptsJACKPHYConversion from MARC 21losslessminimal lossConversion to MARC 21lossless?minor loss· BibliographicOCLCOCLC ROCLC DCPS· Authorityx· Classification· Community· HoldingsMARC XML could potentially be used as follows:for representing a complete MARC record in XMLas an extension schema to METS (Metadata Encoding and Transmission Standard)to represent metadata for OAI harvestingfor original resource description in XML syntaxfor metadata in XML that may be packaged with an electronic resourceSome MARC XML advantages are:The schema supports all MARC encoded data regardless of formatThe MARC XML framework is a component-oriented, extensible architecture allowing users to plug and play different software pieces to build custom solutionsLimitations of MARC XML:Validation with the MARC XML schema is accomplished via a software tool. This software, external to the schema, will provide three possible levels of validation: (1) Basic XML validation according to the MARC XML Schema; (2) Validation of MARC21 tagging (field and subfield); (3) Validation of MARC record content, e.g., coded values, dates, and times.“MARC slim” (xmlns=" – a simple validation approachAdded note: UNIMARC and XML:Ministère de la culture et de la communication (France), Board of Research and Technology is developingBiblioML DTD for converting UNIMARC to XMLConversion tools
10 Dublin Core DCMES also issued as NISO Z39.85 Dublin Core Metadata Element SetISO 15836:2003(E) The Dublin Core metadata element setA standard for cross-domain resource descriptionDesigned primarily to support discovery and retrievalDefines semantics but not syntax (i.e. container)Choice of simple or qualified DCMaintenance agency: Dublin Core Metadata Initiative (DCMI) hosted by OCLC ResearchValue of Dublin Core:Simplicity, extensibility, interoperabilityWorldwide adoption (DCMES translated into 20+ languages)Usable as crosswalk between major metadata standardsApplications of Dublin Core:Open Archives Initiative (OAI) mandates DC metadataWide variety of extended versions in use:In digital library, archives, museums projectsBy e-government programs (AU, CA, DK, FI, IE, NZ, UK)OCLC usage: Connexion, DCPS, ContentDM, ResearchDCMES also issued as NISO Z39.85
11 ONIXONIX International (Online Information Exchange):Standard data exchange format for publishers & jobbersBased on EPICS (EDItEUR Product Information Communication Standards)For representing and communicating book industry product information in electronic formOffers two levels of richness (level 1 & level 2)XML schema with Unicode encodingMaintenance agency: EDItEUR working with input from the Book Industry Communication (BIC) and the Book Industry Study Group (BISG)Value of ONIX:Meets needs of publishers, jobbers, retail sellers for:Easier access to richer book data (including bibliographic data, cover art, blurbs, TOCs, UPC data, and much more)An inexpensive-to-implement common data exchange formatApplications of ONIX:Primarily oriented towards publishers, jobbers, retailersMost major players (Amazon, Baker & Taylor, etc.) now using/supporting ONIXSome interest by libraries & ILS vendors in ONIXONIX is a new standard built by the publishing community. It’s designed to convey a variety of descriptive metadata, but also business-related metadata and important auxiliary objects like image files of book cover artThe goal of ONIX is to standardize the transmitting of product information so that wholesalers, retailers and others in the supply chain will all be able to accept information that is transferred electronically in ONIX International format
12 TEIText Encoding Initiative (TEI):For complex markup of literary textsBoth SGML & XML DTDs availableTEI “header” (TEIH) can be used as a metadata recordMaintenance agency: TEI Consortium:TEI Consortium has executive offices in Bergen, Norway, and is hosted at four university sites worldwide: the Univ. of Bergen, Brown Univ., Oxford Univ., and the Univ. of VirginiaMaintains “P4” Guidelines for Electronic Text Encoding and InterchangeValue of TEI:Designed to meet the needs of scholarly research community (esp. in the humanities) for a variety of activities including:Adding in-line academic commentary in e-textsAs an aid to research by supporting special indexing points, etc.Applications of TEI:Widely used by major humanities electronic text collections such as CETH, UVa e-text center, many others.TEI is an e-text markup schema tailor-made for humanities scholars.
13 EADEncoded Archival Description (EAD)A format for expressing electronic archival finding aidsEAD DTD (Version 2002) is designed to function as both an SGML and XML DTDMaintained jointly by the Library of Congress and the Society of American Archivists (SAA)Value of EAD:Effectively an organized presentation of a collection of documents (typically in an archive or manuscript collection)EAD header carries metadata for the finding aidProvides for simple or complex mark-up to support varying levels of indexingWell-suited for interweaving narrative with links to specific objects in a collection (either directly to the object or via a record for the object that may link to the object).Applications of EAD:Conversion of existing paper finding aids to electronic formWidely used by academic institutions and archives in North AmericaRLG Archival Resources database host copies of many EADsEAD is a standard jointly owned by LC and SAA that provides a format for expressing collection-level variety information in the form of electronic finding aids.
14 METS Metadata Encoding and Transmission Standard (METS) Value of METS: A standard “shell” for encoding data essential for retrieving, preserving, and serving up digital resourcesSix modules define descriptive, administrative, structural, rights and other metadataSome parts of a METS object may be external (e.g., a MODS record for the descriptive metadata)Maintenance agency: Library of CongressValue of METS:Need for METS identified at DLF metadata experts meetingsVaried local approaches to non-descriptive metadata not scaling well & offering little interoperability between agenciesOffers a standard mode for object “packaging” for preservation, institutional repositories, other activitiesApplications of METS:LC: planning to use with selected moving images, audio recordings, folk life mixed media collectionsOCLC DCPS, RLG, Harvard, Stanford, UC Berkeley, National Library of Wales exploring or using for variety of projectsMETS is a new standard designed to capture and convey non-descriptive metadata as well as descriptive metadata.It has been has been developed in response to LC’s own burgeoning digital collection needs with the aid of insight from Digital Library Federation through information the DLF has gathered from its membersUseful presentation (2003):
15 MIX Metadata for Images in XML (MIX) Value of MIX: XML schema for a set of technical data elements required to manage digital image collectionsFormat for interchange and/or storage of the data specified in the NISO Draft Standard Data Dictionary: Technical Metadata for Digital Still Images (version 1.2)Still in early development and testing phasesCollaboration of: Library of Congress and NISO Technical Metadata for Digital Still Images Standards CommitteeValue of MIX:Provides a common XML schema for expressing technical data particular to still and moving digital imagesCan be used with other schema such as METS and MODS as part of a comprehensive approach to managing and preserving digital imagesApplications of MIX:OCLC DCPS, LC, others planning or testingMIX still in nascent stage of development and testingMIX is a very new metadata standard and has been developed by LC with the help of NISO, and is based in large part on NISO’s work on developing technical metadata standards for images.
16 Summary DC ONIX TEI EAD METS MIX Structure XML Encoding Unicode XMLEncodingUnicodeRepertoire of scriptsConversion from MARC 21Lossiness variesMinimal lossHeader only - lossyHeader only - lossyConversion to MARC 21Some ONIX-only data may be lostHeader only – losslessHeader only – losslessChief purposeSimple description for discovery & retrievalPublisher product info exchangeMarkup of scholarly EtextsMarkup of electronic finding aidsShell with technical dataTechnical data for digital imagesPrimary user basee-Govt, Libraries, Museums, Archives,Publishers, JobbersHumanities scholarsArchives, LibrariesMaintenance agencyDCMIEditeurTEI ConsortiumLC w/ SAALC
17 RDF Resource Description Format (RDF) Value of RDF: Graphing theory (i.e. arcs and nodes)-influenced, XML syntax-based metalanguage for expressing metadata about web resourcesDesigned to convey metadata for machine consumption (raw RDF is not very human-readable)Fundamental building block of RDF is the triple (subject + predicate + object)Maintained by the W3C; RDF specification under revisionValue of RDF:A subject of debate (typically RDF vs. XML)!Pro: Model-based expression of metadata critical to the Semantic Web (i.e. derived connections); more flexible, scalable and forgiving standard than XMLCon: RDF carries unneeded processing overhead vs. XML; RDF specification has too many flaws; few use RDFApplications of RDF:Open Directory Project, selected software (e.g., Siderean)OCLC Connexion exports Dublin Core in RDF/XMLSpecification and useful resources:Open Directory:
18 CSDGM (a.k.a. FGDC)Content Standard for Digital Geospatial Metadata (CSDGM) [better known as “FGDC”]CSDGM Version 2 - FGDC-STDDefines a common set of terminology and definitions for the documentation of digital geospatial dataMaintained by Federal Geographic Data Committee (FGDC) [an interagency committee]Crosswalk of FGDC to ISO 19115:2003(E) Geographic information - Metadata available; ANSI technical amendment for ISO-FDGC harmonization in progressValue of FGDC:Provides common standard for publishing metadata about geospatial resourcesWidely used by government and businessMany systems and applications support the standardApplications of FGDC:Adopted or usable by major geospatial agencies in West.Usefulness extended with profiles (e.g. Biological Data)
19 COSATI Committee on Scientific and Technical Information (COSATI) Cataloging rules and record format for the descriptive cataloging of technical reports and similar documentsField tags are alpha strings (not numerical like MARC)Related COSATI subject category list can be usedOwned by CENDI (the Commerce, Energy, NASA, Defense Information Managers Group) [successor to COSATI]Value of COSATI:Supports straightforward capture of useful metadata for scientific and technical informationApplications of COSATI:Used by a number of science/technical and defense U.S. federal agenciesSmall number of library systems (e.g., SIRSI) support COSATI record import/exportCOSATI can be converted to MARC if desiredCOSATI subject category list based on US Engineers Joint Council Thesaurus of Engineering and Scientific Terms & itself now the basis for the SIGLE Subject Category List