DSD Distributed Systems Division MTA SZTAKI Automatic Conversion from MARC to FRBR Christian Mönch (MTA SZTAKI) Trond Aalberg (NTNU)
Distributed Systems Division MTA SZTAKI DSD 2ECDL Trondheim, Norway Outline Bibliographic catalogs FRBR model A framework for extracting FRBR entities from MARC-based catalogs Application to the BIBSYS catalog Results
Distributed Systems Division MTA SZTAKI DSD 3ECDL Trondheim, Norway Record-Based Bibliographic Catalogs Structure: Set of records (search, exchange) Record: Surrogate for a publication Set of attributes, name-value pairs Problems: Non normalized structure with excessive data replication Many search requests are unsupported or require knowledge of bibliographic format
Distributed Systems Division MTA SZTAKI DSD 4ECDL Trondheim, Norway IFLA’s FRBR Model ER-Model, three groups of entities Four operations on entities: search, identify, select, obtain Item Work is realized through is embodied in is exemplified by Translation Expression Adaptation Manifestation Whole/Part Corporate BodyPerson
Distributed Systems Division MTA SZTAKI DSD 5ECDL Trondheim, Norway Availability? Highly structured model, that supports A multitude of search operations Navigation of bibliographic records Expensive to create Re-cataloging unaffordable Automatic conversion
Distributed Systems Division MTA SZTAKI DSD 6ECDL Trondheim, Norway Automatic Creation of FRBR Instances Records Item Manifestation SRecords Expression Work Splitting SRecords Expression 1 SRecords Expression 2 SRecords Work 2 Expression Clustering Work Clustering SRecords Work 1 is realized through Extract manifestations and items from records Identify and split aggregative records Cluster record set to identify works Cluster work sets to identify expressions Create entities from the clusters
Distributed Systems Division MTA SZTAKI DSD 7ECDL Trondheim, Norway Obstacles to the Automatic Creation FRBR Model Instances Inconsistency of data in catalogs: Identical information is represented differently in different records (attributes, syntaxes) Erroneous data Incompleteness of data in catalogs: Information necessary for clustering has not been captured in the records
Distributed Systems Division MTA SZTAKI DSD 8ECDL Trondheim, Norway Obstacles to the Automatic Creation FRBR Model Instances Inconsistency of data in catalogs: Identical information is represented differently in different records (attributes, syntaxes) Erroneous data Might be resolved automaticaly, for example, through authority files Incompleteness of data in catalogs: Information necessary for clustering has not been captured in the records
Distributed Systems Division MTA SZTAKI DSD 9ECDL Trondheim, Norway Obstacles to the Automatic Creation FRBR Model Instances Inconsistency of data in catalogs: Identical information is represented differently in different records (attributes, syntaxes) Erroneous data Incompleteness of data in catalogs: Information necessary for clustering has not been captured in the records Requires additional information linked to individual records
Distributed Systems Division MTA SZTAKI DSD 10ECDL Trondheim, Norway The Attribute Layer SRecords Expression 1 SRecords Expression 2 SRecords Work 2 Expression Clustering Work Clustering SRecords Work 1 Attribute Layer Extract consistent and error-free FRBR-related Generic Attributes and Properties from the records, e.g. title, creator, isTranslation. Specific to bibliographic formats and catalogs
Distributed Systems Division MTA SZTAKI DSD 11ECDL Trondheim, Norway The Attribute Layer for BIBSYS (I) Classify records: Series, monographs Monographs may have each of the following characteristics: Linked Aggregative Example for retrieval of Generic Attributes from monograph records: Attribute title: Searched in: 130$a, 740$a, 240$a (if 240$l does not exist), and 245$a Extended to referenced records
Distributed Systems Division MTA SZTAKI DSD 12ECDL Trondheim, Norway The Attribute Layer for BIBSYS (II) Attribute original title: Searched in: 241$a, 240$a (if 240$l does exist), and 500$a (if it starts with the indicators originaltittler:, or orig.titt.: ) Extended to referenced records Attribute creator: Searched in: 100$a, and 110$a Extended to referenced records
Distributed Systems Division MTA SZTAKI DSD 13ECDL Trondheim, Norway Tested on 4379 records related to Henrik Ibsen Works: 41, of which eight were false positives due to different spelling or spelling errors Expressions: 1111 Manifestations: 1072, of which 35 contained more than one expression But: 3307 records were ignored, because reliable retrieval of Generic Attributes was impossible Unreliable: 580 works, 3706 expressions, 3567 manifestations. Not convincing! Application of the Framework to BIBSYS
Distributed Systems Division MTA SZTAKI DSD 14ECDL Trondheim, Norway Ongoing Work Fault tolerant dissimilarity measure for the clustering process Use of authority files to dissambiguate values Leverage information retrieved from high quality records for incomplete records. Thus making incompleteness a property of the whole catalog and not of single records Apply to a record subset of BIBSYS
Distributed Systems Division MTA SZTAKI DSD 15ECDL Trondheim, Norway Questions?