Presentation is loading. Please wait.

Presentation is loading. Please wait.

From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library.

Similar presentations


Presentation on theme: "From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library."— Presentation transcript:

1 From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library

2

3 Small city, big University = lots of libraries! Cambridge

4

5

6

7

8 Lots of libraries = lots of books

9 Bibliographic records University Library: 3.85 M Other libraries: 2.5 M 8 databases

10 Data problems Quality Duplication

11 Quality - fullness of 2.5 M records in our databases 1 M are short records

12 Quality – coding

13 Duplication

14 Effects Difficulty in resource discovery Patchy retrieval Lack of authority control Difficulty with standard deduplication Burden on staff time Ties us to multiple database model

15 Aims Better records Fewer records

16 Existing Solutions? Manual recataloguing Commercial solutions Universal catalogue Discovery layer Either don’t solve the core problem, or expensive and/or time consuming

17 Our solution Automated Cataloguing Tools! Short record enrichment Automated MARC correction Deduplication Order important – full, well coded records are easier to deduplicate

18 General principles Retrieve some records from a Voyager database Examine and/or manipulate them If necessary, make changes in the database N.B. Watch indexes and table space!

19 General tools Perl – holds everything together Perl DBI – connects to databases SQL – retrieves records from database MARC::Record modules (from CPAN) – to examine/manipulate records Pbulkimport/Batchcat – to make changes to the database

20 Batchcat vs Pbulkimport Batchcat – installed on PC with Voyager More versatile Can’t be used on server Pbulkimport – limited functionality Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN) Can be used on server

21 Books Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: 0596001320 Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN: 1565926994

22 Enriching short records How to get from this …

23 to this

24 Basic mechanism Take short record Find a matching full record Overlay short record with full record Need a source of full records In Cambridge - University Library - large database of full, authority controlled records

25 Connects to EXTERNAL source. Finds best FULL RECORD match and scores it Connects to LOCAL database and checks if a valid bib id Retrieves SHORT RECORD info from local database File of SHORT RECORD bib ids Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD In local database overlays SHORT RECORD with FULL RECORD

26 Output

27 Interface

28 Results Service has been running for 1 year (much of which was testing) 18 libraries subscribed to use service 90,000 short records upgraded

29 MARC checking and correction Bibliographic standard – agreed minimum standard for cataloguing Every week, libraries receive an automatically generated file of MARC coding errors for correction Based on MARC::Lint module with many alterations

30 Output

31 Mechanism Connects to database using Perl DBI Retrieves MARC record for records created/edited in last week Runs them through MARC check Prints errors to file Emails file to library Over 100,000 errors pointed out so far!

32 MARC Correction How to get from this … =LDR 00472nam\\2200157\a\4500 =001 662002 =005 20071205064734.0 =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d =020 \\$a9780961751111 =100 1\$aBroecker, W.S.,$d1931- =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker. =260 \\$aNew York ;$bEldigio Press,$cc1985 =300 \\$a291p $bill $c23cm =504 \\$aIncludes index. =650 \0$aAstronomy. =650 \0$aAstrophysics.

33 to this! =LDR 00453nam 2200157 a 4500 =001 662002 =005 20071205064734.0 =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d =020 \\$a9780961751111 =100 1\$aBroecker, W. S.,$d1931- =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker. =260 \\$aNew York :$bEldigio Press,$cc1985. =300 \\$a291 p. :$bill. ;$c23 cm. =504 \\$aIncludes index. =650 \0$aAstronomy. =650 \0$aAstrophysics.

34 MARC Correction Version of module which, where there is no ambiguity, corrects errors Built into short record upgrade program Also offered as a retrospective service to clean up legacy records Possibility of building it into weekly check

35 Mechanism Connects to database using Perl DBI Retrieves full MARC record Runs against correction module Replaces corrected record in database

36 Output Bib id: 662002 How to build a habitable planet ; By Wallace S. Broecker. 100: UPDATE: Spaces inserted between initials in subfield _a 245: UPDATE: By uncapitalised at start of subfield c 245: UPDATE: Space forward slash inserted before subfield _c 260: UPDATE: Full stop inserted at end of field 260: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Full stop inserted after the p in pagination 300: UPDATE: Full stop inserted at end of field 300: UPDATE: Illustration abbreviation has been corrected 300: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Space inserted between digits and cm 300: UPDATE: Space inserted between digits and p in pagination 300: UPDATE: Space semi-colon inserted before subfield c

37 Results In testing 70,000 records processed Corrected over 200,000 MARC coding errors May run ALL our existing records through at some stage

38 Deduplication – in progress! Three stages: Identification of groups of duplicates Identification/construction of ‘best’ record Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’

39 Identification of duplicates Connect to a database with Perl DBI Use SQL to retrieve records For each record, retrieve all available data from tables Use matching algorithm to identify groups of duplicates

40 And you’ll end up with something like this:

41 Identification of best record For each of group of duplicates, MARC records retrieved Passed to scoring algorithm Record with highest score forms basis of ‘best’ record Retains set fields (i.e. subject headings) from ‘other’ records Corrects any MARC coding errors

42

43

44

45

46 But … No relinking functionality, even in BatchCat No viable workaround for libraries using Acquisitions/without losing circulation history

47 In conclusion … Tools for librarians, not replacements! Do the stuff programs do well, allowing humans to concentrate on what humans do well Won’t do all the work, just makes a solution to major data problems feasible

48 Questions?


Download ppt "From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library."

Similar presentations


Ads by Google