Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest.

Slides:



Advertisements
Similar presentations
John Espley and Robert Pillow ALA New Orleans 26 June 2011 The RDA Sandbox and RDA Implementation Scenario One.
Advertisements

FROM RLIN TO OCLC CONNEXION DIFFERENT WORKFLOWS AND DIFFERENT PRACTICE Teresa Mei East Asian Catalog Librarian Cornell University Library.
OCLC Online Computer Library Center OCLC Cataloging Update Connexion client 1.50 & more OCLC CJK Users Group Annual Meeting San Francisco, CA April 8,
Getting Started with MarcEdit
Cataloging: Millennium Silver and Beyond Claudia Conrad Product Manager, Cataloging ALA Annual 2004.
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
Millennium Cataloging in Release 2005 Georgia Fujikawa Manager, Training Programs.
Media: Text “Words and symbols in any form, spoken or written, are the most common system of communication.” ~ unknown.
Overview of Search Engines
M AKING E - RESOURCE ACCESSIBLE FROM ONLINE CATALOG *e-books *serials Yan Wang Senior Librarian Head of Cataloging & Database Maintenance Central Piedmont.
OCLC Online Computer Library Center Two Paths to Interoperable Metadata Jean Godby, Devon Smith, Eric Childress DC-2003 September 29, 2003.
JSP Standard Tag Library
Global Update with Confidence Mary M. Strouse Innovative Users Group May 19, 2009.
MarcEdit Basics and Beyond By Mary Aycock Head, Catalog Department Missouri University of Science and Technology MOBIUS 2012 Conference.
WILIUG 1. June 2, 2005 Using Review Files with Millennium Rapid & Global Update jenny schmidt SWITCH Library Consortium.
5/14/2003ALAO Spring Workshop 2003 Providing Access Cataloging –Requirements –One record or separate records for multiple formats –CONSER policy for simultaneous,
Implementing Archivists’ Toolkit (AT) in UGA Special Collections Esther Giezendanner (Cataloging), Abby Griner (Russell Library), Sheila McAlister (Digital.
East Meets Rest Adding East Asian Scripts to Harvard’s ILS Prepared for presentation to the North American Aleph Users’ Group 2 June 2003 Charles Husbands,
Jennifer Bowen, University of Rochester ALA Midwinter Conference January 22, 2012, Dallas, TX The eXtensible Catalog (XC): Transitioning to a Post-MARC.
Libraries Australia Cataloguing Parallel Session Bemal Rajapatirana / Rob Walls.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
Vended Authority Control --Procedures and issues.
Updated :02 Hong Kong University of Science & Technology Library XML Name Access Control Repository at the Hong Kong University of Science.
Introduction technology XSL. 04/11/2005 Script of the presentation Introduction the XSL The XSL standard Tools for edition of codes XSL Necessary resources.
WORKING WITH XSLT AND XPATH
TERRA KRIDLER SENIOR LIBRARIAN & ASSISTANT UNIVERSITY ARCHIVIST AMERICAN UNIVERSITY IN CAIRO MIDDLE EAST AND NORTH AFRICA INNOVATIVE USERS GROUP CONFERENCE.
Lucas Mak and Dao Rong Gong Michigan State University Millennium and XML: Repurposing and Customizing Metadata May , 2009.
OCLC Online Computer Library Center Kathy Kie December 2007 OCLC Cataloging & Metadata Services an introduction.
Let VRS Work for You! ELUNA Conference 2008 Presenter: Kelly P. Robinson GIL Service Georgia State University
CSE3201/CSE4500 Information Retrieval Systems XSLT – Part 2.
Cataloging 12.3 to 14.2 Seminar. Cataloging 2 -New check routines -Cataloging authorizations -Other innovations -Fix and expand routines -Floating keyboard.
Highlights from recent MARC changes Sally McCallum Library of Congress.
An Alternative Approach to Interoperability Testing The Use of Special Diagnostic Records in the Context of Z39.50 and Online Library Catalogs William.
ECA 228 Internet/Intranet Design I XSLT Example. ECA 228 Internet/Intranet Design I 2 CSS Limitations cannot modify content cannot insert additional text.
CITA 330 Section 6 XSLT. Transforming XML Documents to XHTML Documents XSLT is an XML dialect which is declared under namespace "
XSLT Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
Serving society Stimulating innovation Supporting legislation Joint Research Centre The Inspire Geoportal Validator.
Web OPAC & GUI (Staff) Search v.16 eSeminar Doron Greenshpan.
Filing and Word Breaking Procedures. 2 Session Agenda Pre-14.x tab_word_breaking table Structure Procedures Special remarks tab_filing table Structure.
Evolving MARC 21 for the future Rebecca Guenther CCS Forum, ALA Annual July 10, 2009.
Understanding InfoHawk Indexes Technical Background for Libraries Staff Patricia Baird Sue Julich.
A worldwide library cooperative OCLC Online Computer Library Center OCLC CJK Users Group 2007 Annual Meeting March 24, 2007, Boston David Whitehair, OCLC.
Connexion Comparison Client or Browser? Fran Juergensmeyer Waukegan Public Library 2 nd Annual WILIUG Conference June 16, 2006 Cataloging from A (Authority)
Demonstration of HKCAN database Outline Database system overview Software characteristics Database status.
Web Z: A Non-Programmers Perspective Sandy Card State University of New York at Binghamton March 23, 1999.
The physical parts of a computer are called hardware.
UoS Libraries 2011 EndNote X5 - basic graduate session.
ARABIC SCRIPT CATALOGUING at Georgetown University in Qatar Stefan Seeger MENA-IUG 5 th Annual Conference, Dubai 2010.
CSE3201/CSE4500 Information Retrieval Systems XSLT – Part 2.
The Catalog of the Future: Integrating Electronic Resources By Dana M. Caudle Cataloging Librarian Auburn University Libraries
Load Profile Training. Copyright Innovative Interfaces. Not to be duplicated or distributed without permission of Innovative Interfaces. Agenda – Day.
A& M Libraries Voyager Training Basic Cataloging February 21, 2007 Janet H. Ahrberg Oklahoma State University Library.
Planning for RDA Authority Conversion Karen Anderson Authority Control Librarian Backstage Library Works Authority Control Interest Group ALA Annual Conference.
Cataloging v.16 eSeminar September 2003 Judith Fraenkel.
The ___ is a global network of computer networks Internet.
BIBFRAME and Linked Data at the University of Washington Joseph Kiegel.
Creative Create Lists Elizabeth B. Thomsen Member Services Manager
Terry Reese Build your toolbox: In depth data manipulation with MarcEdit to prepare your data for the ANBD Terry Reese
7th Annual Hong Kong Innovative Users Group Meeting
Introduction to MarcEdit
Professional development training on cataloging at the University Wisconsin-Madison Memorial Library, USA 14th October -24th October, 2016 Aigerim Shurshenova.
A Lightweight Structured Data Implementation Using JSON-LD and Schema
Metadata Editor Introduction
Workshop on XML-Based Library Applications 5
Working the A to Z List enhance journal access in the OPAC
Cataloging Tips and Tricks
Case Study: Fixing MARC data with MarcEdit and OpenRefine
Giles Martin for the EPC Meeting October 12-14, 2005
E-Resources in Prospector
Cataloging 14.2 Seminar.
Presentation transcript:

Batch-conversion of Non-standard Multiscript Records by XSLT Lucas Mak Metadata and Catalog Librarian Michigan State University Catalog Management Interest Group, ALA Midwinter 2011, Jan. 8, 2011, San Diego CA

Agenda Background – Structure of multiscript records Model A vs. Model B – Using z39.50 for cataloging Multiscript records retrieved through z39.50 – Coding issues – Problems caused by non-standard multiscript records Solutions – Design of XSLT Processing logic Factors affecting the design Limitations & unintended consequence

Structure of Multiscript Records Multiscript records – For recording data in multiple scripts in MARC records – One script may be considered the primary script of the data content of the record, even though other scripts are also used for data content – Two models Model A: Vernacular & Transliteration Model B: Simple Multiscript Records

Structure of Multiscript Records Model A: Vernacular & Transliteration – The regular fields may contain data in different scripts and in the vernacular or transliteration of the data. Fields 880 are used when data needs to be duplicated to express it in both the original vernacular script and transliterated into one or more scripts – Model A data in the regular fields is linked to the data in 880 fields by a subfield $6 that occurs in both of the associated fields $6 [linking tag]-[occurrence number]/[script identification code]/[field orientation code] * MARC21 Bibliographic Appx. D

Structure of Multiscript Records Model A: Vernacular & Transliteration Linking Tag Occurrence Number Script Identification Code Field Orientation Code

Structure of Multiscript Records Model A: Vernacular & Transliteration Linking Tag Occurrence Number

CJK Record according to Model A Specifications

Structure of Multiscript Records Model B: Simple Multiscript Records – All data is contained in regular fields and script varies depending on the requirements of the data – Repeatability specifications of all fields should be followed – Although the Model B record may contain transliterated data, Model A is preferred if the same data is recorded in both the original vernacular script and transliteration – Field 880 is not used * MARC21 Bibliographic Appx. D

CJK Record according to Model B Specifications Item in Chinese. Cataloging language in English

Structure of Multiscript Records Field 066 (Character Sets Present) – To indicate the MARC-8 character sets other than the default sets that are invoked in the record MARC-8 vs. Unicode Environment MARC-8Unicode MARC Field 066RequiredN/A Script Identification Code RequiredN/A Field Orientation Code Required

z39.50 for Cataloging SkyRiver – MSU switched to SkyRiver in Oct 2009 – Ways to expand the pool of re-usable bibliographic records z39.50 function in Innovative Millennium (day-to-day cataloging) MarcEdit z39.50 client (HathiTrust record load)

z39.50 search in Millennium

z39.50 search in Millennium (Record retrieved for Editing)

HathiTrust Data Availability

MarcEdit z39.50 Client (HathiTrust) Batch search against Univ. of Michigan Catalog using UM record identifier

U of M Catalog MSU Catalog Record Dump HathiTrust Record Load Workflow Request Retrieve

Non-standard Multiscript Records from z39.50 Sample Non-standard CJK Record Retrieved by MSU Millennium z39.50 Client

Same Record in Source Library Catalog (Staff View)

HathiTrust Record Retrieved by MarcEdit z39.50 Client* * As of Dec. 10, 2010, Univ. of Michigan has rebuilt 880 fields on their z39.50 serving records Non-standard Multiscript Records from z39.50

Same HathiTrust Record in Univ. of Michigan Catalog (Staff View)

Coding Issues Non-standard Coding Field-pairing – Vernacular data in regular field No linking tag in subfield $6 No script identification code in subfield $6 (may be due to Unicode environment) Standard Model A Coding Field-pairing – Transliteration in regular field – Vernacular data in 880 field Linking tag – Tag number of an associated field Script identification code* – $1 => CJK script * Applicable to MARC-8 encoded records

Coding Issues Non-standard Coding No field orientation code in subfield $6 Standard Model A Coding Field orientation code – /r

Coding Issues Non-standard Coding Practice Repeat non-repeatable fields (245, 250) Duplication of data in both vernacular and transliteration Model B Guidelines Repeatability specifications of all fields should be followed Model A is preferred if the same data is recorded in both the original vernacular script and transliteration

Problems Caused by Non-standard Multiscript Records Irregular/Incorrect field orientation in Arabic and Hebrew records in OPAC display – Left-to-right display of subfields in “Title” due to the lack of “Field Orientation code” while scripts within subfields are from right to left “Field Orientation code” added back

Problems Caused by Non-standard Multiscript Records Irregularity in result display – Inconsistent sequencing of vernacular and transliteration fields

Problems Caused by Non-standard Multiscript Records Database maintenance – Data structure inconsistency Same kind of data resides in two different places Extra steps needed to accommodate inconsistencies – Heading validation issues NACO records with headings in vernacular in 4xx since mid 2008 Vernacular headings (4xx) in regular fields

Problems Caused by Non-standard Multiscript Records Expectation in retrieval of vernacular data – MSU only indexes CJK and Cyrillic data in 880 fields – Arabic, Hebrew, Greek, and other vernacular data in regular fields of non-standard multiscript records are indexed and searchable Create a false impression that patrons can search in scripts other than CJK and Cyrillic

Solutions MSU uses Model A for multiscript records Tasks – To change field tag of vernacular data to 880 – Subfield $6 in both regular & 880 fields To insert linking tag – Subfield $6 in 880 fields To insert script identification code* To insert field orientation code for Arabic & Hebrew records – To insert 066 field if not already exist* *No longer applicable since MSU has moved to Unicode environment

Solutions Necessary steps – Determine which fields contain vernacular data Replace regular field tag with 880 – Determine which script(s) is contained in a record Insert field 066* Insert “Script Identification code*” and “Field Orientation code” when appropriate *No longer applicable since MSU has moved to Unicode environment

Solutions XSLT (Extensible Stylesheet Language Transformation) – Within the family of XML Current version: 2.0 Case sensitive – “Transformation”means: Manipulation of XML documents by creating a new document based on the original document – Common usages in library context Web display – e.g. converting EAD into HTML for display Metadata crosswalking – Data selection and manipulation – Conditional processing Specify matching criteria and corresponding action(s)

Database Maintenance Workflow MSU Catalog Format Conversion XSLT Processor Format Conversion Uncorrected MARC File Uncorrected MARCXML Corrected MARCXML Corrected MARC File

U of M Catalog MSU Catalog Corrected records Alternative HathiTrust Pre-load Data Cleanup Workflow Request Retrieve XSLT Processor Uncorrected records

Design of XSLT Processing logic – Regular field to 880 and insert linking tag Remove all roman data from a field Determine length of a field – 0 => no vernacular data – ≠0 => contains vernacular data – Field 066, Script identification & Field orientation codes Match vernacular data field against vernacular characters

Design of XSLT Remove all roman data – Roman data (ASCII, special characters & diacritics used in transliteration) – replace() and translate() functions Find “pattern A” and replace it with “pattern B” – Replace roman data with nothing

Design of XSLT Test the length of the field after removing all non- vernacular data – XSLT elements: in combination with & – XSLT functions: string-length() …… [series of actions when string-length equals 0] …… [series of actions when string-length not equals 0]

Design of XSLT Field with no vernacular data Test length of the field Insert original values Insert linking tag (880) and original occurrence number Copy subfields other than $6

Design of XSLT Field with vernacular data …… [Insert “Script Identification Code” & “Field Orientation Code”] Insert original values Insert original tag no. as linking tag Insert original occurrence number Insert “880” as tag no.

Design of XSLT Insert “Script Identification Code” (MARC-8 environment) /(3 /(S /(2 /(N /$1 Insert code for Arabic Insert code for Greek Insert code for Hebrew Insert code for Cyrillic Insert code for CJK

Design of XSLT Insert “Field Orientation Code” //r Test if the subfield contains Arabic or Hebrew script Insert Field Orientation Code

Design of XSLT Field 066 (MARC-8 environment) – Insert character set code in subfield $c – A single record may have more than one vernacular script => multiple subfield $c XSLT element: – Allows multiple matches XSLT function: matches() – Processing logic Turn the whole record into a text string Remove all Latin data Match vernacular script against normalized text string

Design of XSLT After removing all Latin data from the record … c (3 …… c $1 Replace Arabic characters with “3” Test if the normalized data contains “3” Insert “(3” as the character set code in $c Test if any non-alpha-numeral characters exist Insert code for CJK

Design of XSLT Factors affecting the design – Pre-load vs. post-load data clean up (HathiTrust workflow) Mechanism to filter out non-multiscript records needed for pre-load data clean up Construction of 949 overlay command* – MARC-8 vs. Unicode Field 066 and Script identification code not allowed in Unicode environment – 2 separate XSLTs made – OCLC vs. MARC21 Standard Representation of Bengali, Devanagari, Tamil, and Thai in field 066 * Innovative Millennium specific

Limitations & Unintended Consequences Processing of data represented by UTF-8 character number – \U+0e33\\U+0e43\\U+0e2b\\U+0e49\ Vernacular scripts processed (MARC-8 environment) Handling of unlinked vernacular data – Implications on OPAC display

Questions? Lucas Mak Michigan State University Libraries